A statistical information extraction system for Turkish
نویسندگان
چکیده
This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34%, which is 21% better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32% better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.
منابع مشابه
Developing a Concept Extraction System for Turkish
In recent years, due to the vast amount of available electronic media and data, the necessity of analyzing electronic documents automatically was increased. In order to assess if a document contains valuable information or not, concepts, key phrases or main idea of the document have to be known. There are some studies on extracting key phrases or main ideas of documents for Turkish. However, to...
متن کاملName Tagging Using Lexical, Contextual, and Morphological Information
Abstract This paper presents a probabilistic model for automatically tagging names in a Turkish text. We used four different information sources to model names, and successfully combined them. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained a significant improvement. After this, we modeled the mor...
متن کاملDesign and evaluation of an ontology based information extraction system for radiological reports
This paper describes an information extraction system that extracts and converts the available information in free text Turkish radiology reports into a structured information model using manually created extraction rules and domain ontology. The ontology provides flexibility in the design of extraction rules, and determines the information model for the extracted semantic information. Although...
متن کاملThe Use of Remote Sensing and GIS Technologies for Comprehensive Wastewater Management
In the present study, it was aimed at combining remote sensing technology, geographic information system and statistical data in order to provide a rather rapid, sensitive and comprehensive overview of current situation of Turkish river basins in terms of existing spatial data. For the aim of the study, all statistical information gathered from the national authorities on regional basis was ove...
متن کاملA Turkish Handprint Character Recognition System
This paper presents a study for recognizing isolated Turkish handwritten uppercase letters. In the study, first of all, a Turkish Handprint Character Database has been created from the students in Istanbul Technical University (ITU). There are about 20000 uppercase and 7000 digit samples in this database. Several feature extraction and classification techniques are realized and combined to find...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Natural Language Engineering
دوره 9 شماره
صفحات -
تاریخ انتشار 2003